Data Summary
income <- read.csv("https://ecoleman451.github.io/website/Data%20Visualization/Datasets/income_per_person.csv")
life <- read.csv("https://ecoleman451.github.io/website/Data%20Visualization/Datasets/life_expectancy_years.csv")
# Reshape data set such that there are only three columns (Geo, Year, & Income)
new_income <- pivot_longer(income, cols = -geo, names_to = "year", values_to = "income")
new_life <- pivot_longer(life, cols = -geo, names_to = "year", values_to = "life.expectancy")
## Create new data set
LifeExpIncom <- merge(new_life, new_income, by = c("geo", "year"))
## Read in More Data
country <- read.csv("https://ecoleman451.github.io/website/Data%20Visualization/Datasets/countries_total.csv")
pop <- read.csv("https://ecoleman451.github.io/website/Data%20Visualization/Datasets/population_total.csv")
new_pop <- pivot_longer(pop, cols = -geo, names_to = "year", values_to = "population")
## Merge LifeExpIncom with Country
merged <- merge(LifeExpIncom, country, by.x = "geo", by.y = "name", all.x = TRUE)
## Merge Population with Merged Data
fin_data <- merge(new_pop, merged, by = c("geo", "year"), all.x = TRUE)
## Get Data for Year 2000
final_data <- subset(fin_data, year =="X2000")
We first read in two datasets called “income” and “life,” which
represent income and life expectancy values over many years. “Income”
has 193 observations with 220 total variables, while “Life” has 187
observations and 220 total variables. Next, we reshape both datasets to
have only three columns: Geo, Year, and Income or Life Expectancy. We
then merge these reshaped sets into a dataset called “LifeExpIncome,”
which now contains Geo, Year, Income, and Life Expectancy (40953
observations and 4 variables). Next, we read in two more datasets:
“country” (240 observations and 11 variables) and “pop” (195
observations and 220 variables), representing country and population
data, respectively. We reshape “pop” to align with “LifeExpIncome” and
“Country,” which already have Year transformed into a single column.
After this, we merge “LifeExpIncome” with “Country” and then merge this
newly combined set with the reshaped “pop” set, creating a dataset
called “fin_data” (42705 observations and 15 variables). Finally, we
subset the data to focus only on data from the year 2000, resulting in
our “final_data” set (195 observations and 15 variables):
GGPlot
The scatter plot below shows the relationship between income, life
expectancy, and population size across different regions in the year
2000. Each point represents a country, with the size of the points
corresponding to the population size of that specific region. The
countries are color-coded for better visualization.
scatter_pop <- ggplot(final_data, aes(x = life.expectancy, y = income, color = region, size = population)) +
geom_point() +
labs(title = "Life Expectancy vs. Income per Region (2000)",
x = "Life Expectancy",
y = "Income",
size = "Population",
color = "Region")
scatter_pop
From the plot, we observe a slightly positive correlation between income
and life expectancy. It indicates that countries with higher incomes are
likely to have longer life expectancies. Additionally, countries in the
Americas and Asia tend to have larger populations, as indicated by the
larger point sizes. This also suggests that countries with higher
populations might have longer life expectancies. European countries
appear to have the longest life expectancies, with most of their points
on the far right side of the graph, although their populations are not
as large as those of other regions. Next, we subset the data to focus on
the year 2015, resulting in our “final_data” set (195 observations and
15 variables). Now, let’s examine the overall summary statistics for the
dataset “fin_data,” which includes data from all years, not just
2015.
## Get Data for Year 2015
final_data <- subset(fin_data, year =="X2015")
Plotly
The plot below shows the relationship between income, life
expectancy, and population size across different regions over several
years. Each point represents a country, with the size of the points
corresponding to the population size of that specific country. The
countries are color-coded by region for better visualization. To make
the plot more visually appealing, we’ve applied a transformation to the
population size using a logarithmic function. This transformation
compresses the range of population sizes, reducing the size of the
points and making the plot clearer and easier to interpret.
Additionally, the x-axis uses a logarithmic scale to better visualize
the wide range of income values. The plot is animated to show how these
relationships change over time, providing a dynamic view of global
trends in income, life expectancy, and population:
pal.IBM <- c("#332288", "#117733", "#0072B2","#D55E00", "#882255")
pal.IBM <- setNames(pal.IBM, c("Asia", "Europe", "Africa", "Americas", "Oceania"))
# Ensure no NA values in the region column
final_data <- final_data %>%
filter(!is.na(region)) # Remove rows with NA in the region column
# Filter data to remove NA values and convert year to numeric
final_data$year <- as.numeric(gsub("X", "", final_data$year))
final_data <- final_data %>%
filter(!is.na(life.expectancy) & !is.na(income) & !is.na(population))
fig <- final_data %>%
plot_ly(
x = ~income,
y = ~life.expectancy,
size = ~(2*log(population)-11)^2,
color = ~region,
colors = pal.IBM, # custom colors
frame = ~year, # the time variable to
text = ~paste("Country:", geo,
"<br>Region:", region,
"<br>Year:", year,
"<br>Life Expectancy:", life.expectancy,
"<br>Population:", population,
"<br>Income per Person:", income),
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
)
fig <- fig %>% layout(
xaxis = list(
type = "log"
),
title = "Income vs. Life Expectancy Over Time",
xaxis = list(title = "Income per Person (Log Scale)"),
yaxis = list(title = "Life Expectancy")
)
fig
The x-axis represents the income levels for each country, with higher
incomes positioned further to the right. The y-axis represents life
expectancy, with higher life expectancies positioned higher on the axis.
From the plot, we can observe that some countries dominate the scatter
plot due to their larger population sizes and higher incomes. This
visualization allows us to analyze whether countries with higher incomes
generally have longer life expectancies. Additionally, we can examine
whether there is a correlation between population size and income
levels, helping to identify trends and patterns in the data.
In the animated plot, each frame corresponds to a different year,
showing how the relationship between income, life expectancy, and
population size evolves over time. The size of each point is determined
by the population of the country, with larger points indicating larger
populations. The color of the points indicates the region to which the
country belongs, allowing us to see regional trends and differences more
clearly. By observing the animation, we can identify how economic and
health outcomes have changed across different regions and time periods,
providing insights into global development patterns.
